In this part of the course, we will cover the following concepts:
| Objective | Complete |
|---|---|
| Summarize the concepts of distance matrix and network visualization | |
| Create a distance matrix for a given dataset |
Network visualization is an excellent way of understanding the relationships between individual observations or groups of observations in your data
Networks are a collection of connected objects:
A network graph may also be known as a link chart, a node-link diagram, or a network map
In network theory, we use the distance between two nodes to describe their connection
The smaller the distance, the more similar the two nodes are
Depending on the situation, different distance metrics might be useful:
In this module, we will use the Euclidean distance metric
N nodes is of size N * N, where each value corresponds to the distance between a pair of nodes| Objective | Complete |
|---|---|
| Summarize the concepts of distance matrix and network visualization |
✔ |
| Create a distance matrix for a given dataset |
Rather than create a static network graph, we will render ours as an HTML widget
The htmlwidgets package provides a framework for quickly creating R bindings to JavaScript libraries
HTML widgets are helpful because they can be:
htmlwidgets are:
leaflet for mapsdygraphs for time seriesrthreejs for interactive 3D graphicsvisNetwork to create an HTML widget for network visualizationhtmlwidgetsid columnfrom and to columnsIn order to maximize the efficiency of your workflow, use the box package and encode your directory structure into variables
Let the main_dir be the variable corresponding to your materials folder
data directory inside the materials folder in your environment; hence we will save their path to a data_dir variableplots directory corresponding to plot_dir variablepaste0 command and pass the strings you would like to paste together# Read in the healthcare-dataset-stroke dataset.
hds = read.csv(file = file.path(data_dir,"/healthcare-dataset-stroke-data.csv"), #<- provide file path
header = TRUE, #<- if file has header set to TRUE
stringsAsFactors = FALSE) #<- read strings as characters, not as factors
# View the dimensions of the dataset.
dim(hds)[1] 5110 12
id gender age hypertension heart_disease ever_married work_type
1 9046 Male 67 0 1 Yes Private
2 51676 Female 61 0 0 Yes Self-employed
3 31112 Male 80 0 1 Yes Private
4 60182 Female 49 0 0 Yes Private
5 1665 Female 79 1 0 Yes Self-employed
Residence_type avg_glucose_level bmi
1 Urban 228.69 36.6
2 Rural 202.21 NA
3 Rural 105.92 32.5
4 Urban 171.23 34.4
5 Rural 174.12 24.0
# Subset a few columns.
hds_small = hds %>%
select(age, avg_glucose_level,
bmi, stroke)
# View the first few rows of the dataset.
head(hds_small) age avg_glucose_level bmi stroke
1 67 228.69 36.6 1
2 61 202.21 NA 1
3 80 105.92 32.5 1
4 49 171.23 34.4 1
5 79 174.12 24.0 1
6 81 186.21 29.0 1
# Convert BMI column to numerical
hds_small$bmi <- as.numeric(as.character(hds_small$bmi))
# Remove rows with NA.
hds_small = na.omit(hds_small)
# We keep only the unique rows since duplicate rows would have a distance of 0.
hds_small= unique(hds_small)
head(hds_small) age avg_glucose_level bmi stroke
1 67 228.69 36.6 1
3 80 105.92 32.5 1
4 49 171.23 34.4 1
5 79 174.12 24.0 1
6 81 186.21 29.0 1
7 74 70.09 27.4 1
dist() functiondist() takes a numeric matrix or a dataframe as inputage and avg_glucose_level, are at very different scales, we must normalize the distance to values between 0 and 1 for easy interpretation# Create distance matrix.
hds_distance = dist(hds_small)
# `dist` returns the lower triangle of the distance matrix as a vector.
head(hds_distance)[1] 123.52442 60.25356 57.27691 45.36861 159.02075 135.02196
# Normalize the distances to values between 0 and 1.
hds_distance = hds_distance/max(hds_distance)
head(hds_distance)[1] 0.5447699 0.2657315 0.2526038 0.2000855 0.7013166 0.5954766
[1] 0.4552301 0.7342685 0.7473962 0.7999145 0.2986834 0.4045234
| Objective | Complete |
|---|---|
| Summarize the concepts of distance matrix and network visualization |
✔ |
| Create a distance matrix for a given dataset |
✔ |
You are now ready to try tasks 1-2 in the Exercise for this topic